MicNet toolbox: Visualizing and unraveling a microbial network

Applications of network theory to microbial ecology are an emerging and promising approach to understanding both global and local patterns in the structure and interplay of these microbial communities. In this paper, we present an open-source python toolbox which consists of two modules: on one hand, we introduce a visualization module that incorporates the use of UMAP, a dimensionality reduction technique that focuses on local patterns, and HDBSCAN, a clustering technique based on density; on the other hand, we have included a module that runs an enhanced version of the SparCC code, sustaining larger datasets than before, and we couple the resulting networks with network theory analyses to describe the resulting co-occurrence networks, including several novel analyses, such as structural balance metrics and a proposal to discover the underlying topology of a co-occurrence network. We validated the proposed toolbox on 1) a simple and well described biological network of kombucha, consisting of 48 ASVs, and 2) we validate the improvements of our new version of SparCC. Finally, we showcase the use of the MicNet toolbox on a large dataset from Archean Domes, consisting of more than 2,000 ASVs. Our toolbox is freely available as a github repository (https://github.com/Labevo/MicNetToolbox), and it is accompanied by a web dashboard (http://micnetapplb-1212130533.us-east-1.elb.amazonaws.com) that can be used in a simple and straightforward manner with relative abundance data. This easy-to-use implementation is aimed to microbial ecologists with little to no experience in programming, while the most experienced bioinformatics will also be able to manipulate the source code’s functions with ease.


Introduction
Microbiomes are not a mere collection of independent individuals, but rather, ensembles of intricate constituents, biotic and abiotic, that create highly complex systems where emergent interactions, structures, and functions are crucial for the survival and performance of the whole. The inference of microbial co-occurrence networks may help in understanding PLOS ONE PLOS ONE | https://doi.org/10.1371/journal.pone.0259756 June 24, 2022 1 / 28 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Table 1. Description of several network metrics and properties currently used in biological networks, including some of their prospective interpretations.

References
Total Nodes/Vertices Total Entities within a network. Total number of taxa (species, OTUs/ASVs) in a network (species richness); number of connected taxa; common measure of ecosystem state in response to perturbations [4,[28][29][30] Edges, Links, Relationship, connection Relationship or associations between nodes. For cooccurrence networks, relationships exist between pairs of nodes.
Ecological associations, including interspecific interactions, niche overlap, cross-feeding, abiotic cooccurrence drivers, among others [1,4,31,32] Density, Connectance, Complexity (network scale), Interactions diversity, Probability of connection Fraction of edges that are actually present in the network with respect to all possible edges.
Reflection of the incidence of ecosystemic processes; Possible measure of ecological resilience; organization level of the community; measure of complexity in the microbial network [4, 28, 33- [38][39][40] Average degree, Complexity (taxon scale), Connectedness (normalized degree) Average number of edges connected to a node; average number of neighbors for a given node.
Measure of complexity in the microbial network [4,27,35,36,41] Ratio of positive and negative relationships. If > 1 there are more negative interactions, if < 1 there are more positive interactions present in the network.
Potential measure of cooperation level within the community; measure of community stability (ecological resilience and resistance) [3,4,36,41,[48][49][50][51] Average shortest path length (AL), Average geodesic path Average number of steps in the shortest paths from one node to another. It is calculated for all pairs and then averaged.
Microbial networks usually present small AL: measure of network's response speed to perturbations (ecological resilience); community cohesion; measure of information and substance flow [4,35,37,42,[52][53][54] Diameter, Longest geodesic path Length of the longest finite geodesic path anywhere in the network.
Measure of information and substance flow [27,36,42,54] Small world index (SW) Index based on a tradeoff between high clustering coefficient and short path length, the defining characteristics of small-world networks. Networks with SW > 1 are said to have more "small-worldness".
Microbial network topological property. Small-world microbial networks suggests that any two members in the community could interact with each other through a few intermediaries. [6,55] Clustering coefficient, Transitivity Average probability that two nodes neighbors of a third node are also connected between each other.
Motifs can be relevant in information flow (e.g., quorum sensing); potential biomarkers for microbiome perturbed state.
In an attempt to capture the most relevant information from a microbial community in their co-occurrence network and to try to overcome some common issues, we have developed the MicNet toolbox, an open source code to create, analyze and visualize microbial co-occurrence networks. We implemented UMAP [64], a dimension reduction algorithm which has been previously used to identify unique clusters of data in several genomic projects [65,66], given that it is both scalable to massive data and able to cope with high diversity [64]. Moreover, we coupled UMAP with different types of projections and HDBSCAN [67], an unsupervised clustering algorithm, able to identify both local and global relationships, as well as filtering out noise. Finally, we used and enhanced version of SparCC, a compositionally aware algorithm, to infer correlations for network construction [17,22]. Additionally, the toolbox includes several analyses of network theory to inspect the topological properties, robustness, structural balance, communities, and hub nodes that arise in microbial co-occurrence networks. The development of the MicNet toolbox, as an integration of several analyses, attempts to provide an easy-to-use and straightforward implementation towards a comprehensive description of potential local and global patterns for a better understanding of microbial community systems.

Input data
MicNet toobox input data consists of relative abundance/compositional datasets from highthroughput sequencing methods, such as metagenomics or metabarcoding. Prior to building an abundance data table, raw assembled sequences should be OTU/ASV clustered. MicNet toolbox currently supports abundance data as .tsv files (separated by tabs) or .csv files (separated by commas). In the web dashboard, input abundance data table is filtered by default, removing singletons (< 5 total counts among all samples) and unique (only appearing in one sample) entries. If the user desires, singleton filtering could be deactivated. For SparCC and UMAP/ HDBSCAN the first column of the table should contain the OTU/ASV ID and the following columns the abundance data. Taxonomic information is optional in the input abundance table since poor taxonomic assignment might hinder the interpretation of results. Hence, the user might prefer to work only with ASV/OUT IDs. If the user wishes to include taxonomic information in the resulting output, taxonomic annotation for the given OUT/ASV can be included in the second column, with ";" as a delimiter between each taxonomic hierarchy (e.g., Bacteria;Cyanobacteria; Cyanophyceae;Nostocales;Nostocaceae;Nostoc). In the case of network analyses, the correlation matrix output by SparCC should be input alongside the UMAP/HDBSCAN output datafile.

Data visualization
UMAP and HDBSCAN implementation. A common first step when visualizing highdimensional data is applying a dimensionality reduction technique. In this toolbox, we Overview of MicNet toolbox. MicNet Toolbox was designed for visualizing, creating and analyzing microbial networks obtained from compositional data (obtained through high-throughput sequencing methods such as metagenomic/16S surveys). Data filtering for singletons and low abundance taxa/ASV/OTU is supported if required. MicNet Toolbox includes two independent main modules: a data visualization module which uses UMAP and HDBSCAN to find local patterns in the data, and a network analysis module which implements an enhanced version implemented UMAP (Uniform Manifold Approximation and Projection), a non-linear dimension reduction technique that favors local data preservation, rather than global data, allowing a better identification of finer scale patterns [64]. We coupled UMAP with HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), a hierarchical clustering algorithm that partitions the data based on their density [67,77]. This clustering technique has been shown to perform well when performed in combination with UMAP dimension reduction [78] and it has been tested with several dataset types, including genetic data, grouping genes into the correct known classes [79]. Thus, we implemented HDBSCAN on the data obtained from UMAP analysis, which represents the abundance data in a reduced space of two dimensions. HDBSCAN analysis not only classifies OTUs/ASVs as belonging to a cluster but also as noise (defined as any point that was not selected in any of the clusters) and outliers (detected with the GLOSH outlier detection algorithm, which works with local outliers).
When running UMAP we set as default a minimum distance of 0.1, number of components of 2, a Helligner output metric and number of neighbors of 15. In the case of HDBSCAN, the default parameters when running MicNet are Bray-Curtis metric, minimum cluster size of 15, minimum sample size of 5 and quantile limit of 0.9 for outlier detection. For our different datasets, the number of neighbors (UMAP), and minimum cluster/sample size (HDBSCAN) parameters were set depending on the input microbiome dataset. We used Bray-Curtis dissimilarity as the distance metric for UMAP, as it is a standard metric for biological datasets. In the web dashboard, UMAP and HDBSCAN parameters can be modified by the user to visualize the results according to each set of microbiome data.

New implementation of SparCC
To obtain the correlation matrix from relative abundance data, we implemented a modified version of the SparCC algorithm, a robust approach to discard spurious correlations when dealing with compositional data [17,22]. Although the original SparCC algorithm was not altered, several modifications were made to improve and scale the SparCC estimation matrix. We made three main changes to the original code; first, the code changed from Python version 2.7 to 3.9. Second, we use Numba and Dask in some parts of the matrix processes, namely functions or operations, with two main improvements: parallelization of operations and scalability in the size of the estimated matrices. Finally, the original SparCC version stores each estimation step in RAM, as arrays in NumPy. Although storing in RAM is efficient for small data sets, with large data the required memory increases rapidly depending on the interaction numbers and the size of the dataset. Thus, we store each estimation step on disk as hdf5 binary format. These changes made it possible to calculate the SparCC estimates with good time performance in easily accessible computing resources. SparCC p-value test on the inferred correlation was not modified, it was calculated with a Monte-Carlo simulation (with default n = 50) as done by Friedman & Alm (2012) [17] and the default value is to calculate one-sided p-values, although this can be modified by the user.
To set SparCC parameter values, we perform a parameter search in our more complex study case from the communities reported by Espinosa-Asuar et al. (2021) [80]. We performed an independent parameter sweep on each parameter, varying the exclusion threshold from 0 to 1 in steps of 0.1, the number of iterations from 10 to 100 in steps of 10 and the of SparCC to create a co-occurrence network; network analyses such as topology comparison, and community analysis are included in the aforementioned network analysis module. If desired by the user, HDBSCAN clustering output can be integrated into the network analysis module.
https://doi.org/10.1371/journal.pone.0259756.g001 exclusion number from 10 to 100 in steps of 10. For each parameter value, we calculated the number of correlations found and selected the parameter value when this number stabilized (S1 Fig). The final parameter values used in the databases presented here were: 50 iterations, an exclusion number of 10, exclusion threshold of 0.10 and 100 simulations for p-value calculation. However, this can be modified by the user both in the dashboard and when running the code from the github repository.

Network analyses
Network analyses were performed to characterize both the overall structure and the local interactions of the microbial co-occurrence network, in which each OTU/ASV is represented as a node and the correlations found by SparCC as undirected weighted edges, such that an edge between two nodes implies a relationship between the two corresponding OTUs/ASVs. Given that most network analyses can only handle positive interactions, we normalized the SparCC correlation matrix from -1 to 1 to a range from 0 to 1, except for the structural balance analysis which directly uses the positive and negative correlation values. It is important to note that the normalization of values does not change the distribution of the correlation values, they are just mapped to another scale to allow running network analyses that do not handle negative values. All available network analyses in the MicNet toolbox are described as follows.
Network topology comparison. Networks have several large-scale structural measurements to characterize their topology. For this purpose, MicNet calculates the following structural metrics: 1) network density, using networkx function nx.density, 2) average degree, calculated as the mean of all nodes degree using numpy mean function, 3) degree standard deviation, using numpy std function, 4) ratio of positive-negative relationships calculated simply as the number of positively weighted edges divided by the number of negatively weighted edges, 5) average shortest path length using nx.average_shortest_path_length, 6) clustering coefficient nx.average_clustering function, 7) modularity, calculated with function nx.modularity, using as network modules those obtained with the nx.greedy_modularity_comminuties algorithm, and 8) the diameter, which was calculated using nx.diameter function. Finally, we have added a custom function that calculates a small-world (SW) index as suggested by Humphries & Gurney (2008) [55]. The SW index is calculated as: Where l and cc are the average shortest path length and clustering coefficient of the experimental co-occurrence matrix, respectively. Analogously, l rand and cc rand are the average shortest path length and clustering coefficient, accordingly, of a comparable random network with the same number of nodes and density. The random network was built using the function nx.erdos_renyi_graph. This is done several times, with default n = 50, and the mean value of SW is returned.
MicNet includes the computation of the distribution of several of this large-scale metrics under the assumption that the underlying topology is: 1) a random Erdos-Renyi network [81] built using function nx.erdos_renyi_graph, 2) a small world Watts-Strogatz network [82] built using nx.watts_strogatz_graph function, or 3) a scale-free Barabási-Albert network [83] built using nx.barabasi_albert_graph function. A short description of these canonical topologies can be found in Fig 2. This allows the comparison of the query data against the different topologies. These simulated networks are built with the same number of nodes, density and average degree as the experimental data, and correlations are drawn from a uniform distribution from -1 to 1. Finally, the simulated networks are made symmetrical to be comparable with the SparCC output correlation matrix.
Degree distributions can also be used to discriminate between network topologies. Thus, we have included in the MicNet toolbox a function that plots the Complementary Cumulative Distribution Function (CCDF) of the degrees of the given network and compares it with the CCDF of a simulated comparable random, scale-free and small-word network on a log-log scale. To calculate the CCDF we first divide the range of the degrees into bins; for each bin we obtain its probability as frequency/total. We used this discrete definition of the Probability Density Function (PDF) to calculate the Cumulative Density function (CDF) as the cumulative sum of the PDF: where x k is the PDF of each bin k previously defined, such that the CCDF is calculated as: We used CCDF since it has been suggested as an easier way to visualize the difference between degree distributions [84,85].
Community analysis. To analyze subnetworks, we used two ways of dividing the network into subunits: 1) We used Louvain method to detect communities in networks (using pythonlouvain library [86]), and 2) we used the clusters found with clustering algorithm HDBSCAN. Each subnetwork's nodes and edges were isolated as a subnetwork using function nx.subgraph, and then for each we obtained the following metrics (also used for network topology analysis): total number of nodes, total number of relationships, density, average degree, and clustering coefficient. Finally, we characterize the diversity of each subnetwork by looking at the total number of different taxa present in each subnetwork at the phylum level.
Percolation analysis. Depending on the network structure, networks can be more or less robust to disruptions. Networks are usually formed by a giant component, which includes between 50% and 90% of the nodes. The formation and dissolution of this giant component is called percolation transition in network theory [87]. The percolation approach consists of removing nodes and their corresponding edges and analyzing how much the network's properties are disrupted [88]. The percolation simulation implemented in MicNet consists of n iterations; in each iteration a percentage of the nodes (with default value of 0.1, but this can be specified by the user) is removed along with all of their edges. After removing the nodes and corresponding edges, the following metrics are calculated for the remaining network: density, average degree, number of connected components (this last one calculated using the nx.connec-ted_components function), size of giant component, fraction of nodes belonging to the giant component, the communities found by the python-louvain algorithm and the network modularity. We implemented several percolation approaches: 1) random percolation, in which nodes are removed randomly; 2) centrality percolation, in which nodes are removed by centrality (whether degree, closeness or betweenness centrality), higher values first; and 3) group percolation, where groups of nodes are removed according to a grouping variable provided, such as taxonomic groups or HDBSCAN groups. Consequently, network robustness to different types of disruptions could be assessed by looking at changes in different network metrics.
Structural balance analysis. Structural balance analysis finds all triangle motifs in the network, that is, nodes that are interacting in triads, and then classifies them as balanced based on the simple analogy that ''my friend's friend is my friend" and ''my friend's enemy is my enemy" [89,90]. This leads to classifying triads of interactions as balanced if they meet this criterion, or as imbalanced otherwise Fig 3. A network is considered to be balanced if most triads found in it are balanced. To calculate structural balance, we found all triads in the network using function nx.cycle_basis, and keeping only the cycles of length three. Then, we classified the found triangle motifs into balanced or imbalanced, depending on their mutual correlations. The output of the analysis is a percentage of balanced and imbalanced triangles with respect to all triangles found, and the exact percentage for each of the four types of triangles displayed in Fig 3. Potential key taxa analysis. Four different centrality measures were implemented to characterize each node (OTU/ASV): degree centrality using function nx.degree_centrality, betweenness centrality using function nx.betweenness_centrality, closeness centrality using function nx.closeness_centrality and PageRank using function nx.pagerank. When running the code, a dataset is returned with each centrality metric for each node.

Dashboard interface
In addition to the freely accessible source code at the github repository, we have also developed a web dashboard at http://micnetapplb-1212130533.us-east-1.elb.amazonaws.com that can be used to run most of the analyses presented here. The dashboard consists of three main parts: 1) UMAP and HDBSCAN, 2) SparCC and 3) Network analyses. In the first component, the raw abundance data should be input (see Input Data section), and several parameters, as well as normalizations, for UMAP and HDBSCAN can be modified as desired. This first component returns an interactive visualization of the UMAP plot along with HDBSCAN clusters identified by color. Both the resulting plot and the file detailing cluster belonging can be downloaded from the dashboard.
The second component is for the estimation of the co-occurrence network with SparCC. In this section, abundance data should be input, and parameters can also be adjusted by the user. The resulting correlation matrix can be downloaded from the dashboard. Finally, the third component includes the post-processing analyses of the co-occurrence network. Thus, the input of this section should be the matrix obtained from the previously defined SparCC component or any square correlation matrix, and the UMAP/ HDBSCAN output file. When run, this section allows the user to obtain the large-scale metrics of the network, the structural balance percentages, descriptions of the communities found, and two network graphs where the size of the node indicates degree centrality, green edges indicate positive relationships between two nodes, whereas red edges indicate negative ones. Finally, in the graph named HDBSCAN the color of the nodes refers to the HDBSCAN cluster they belong to; whereas in the graph named Community the colors indicate the color of the community they belong to, based on Louvain clustering algorithm. When the network of interest consists of less than 500 nodes an interactive visualization plot is deployed, but for larger networks a static plot is returned, given limited computational resources.
For the other network analyses presented, such as percolation analysis and topology comparison, or if the user wishes to run the complete pipeline directly with code, we have also provided a package called micnet that can be install via pip. All the functions necessary to run the complete example of Kombucha are available in the micnet package and their usage is explained in a notebook present in the github repository of the MicNet toolbox. It should be noted that, if more computational resources are needed, the dashboard itself can also be run after downloading the MicNet toolbox from github, creating the conda environment and deploying it with streamlit as suggested in the readme file.  [17], which consists of 50 OTUs in 200 samples drawn from a multinomial log-normal distribution. For this, we compared the real correlations to the estimations performed by SparCC, and then we calculated the RMSE (Root Mean Square Error). Secondly, to corroborate that our implementation of SparCC does indeed scales better to larger datasets than the previous version, we compared the execution times and the RAM consumption between our new version and the previous version of SparCC for datasets containing from 50 to 2600 OTUs.
Biological validation: Kombucha consortium. To further authenticate MicNet Toolbox's approach to analyze microbial co-occurrence networks, we needed to see if biological interactions previously described by experimental work could be replicated in the network. For this, we make use of the kombucha dataset described in Arıkan et al. (2020) [91]. Test data was downloaded from the European Nucleotide Archive (ENA) at EMBL-EBI under the accession numbers ERP104502 (https://www.ebi.ac.uk/ena/browser/view/ERP104502) and ERP024546 (https://www.ebi.ac.uk/ena/browser/view/ERP024546). The raw 16S amplicon reads were filtered, processed and annotated with QIIME 2 [92] and DADA2 [93]. Abundance and taxonomy for each ASV cluster was acquired. The obtained abundance table with all samples was filtered, as singleton and unique counts were removed from the data, as suggested by Berry & Widder (2014) [25]. Filtering unique and singletons resulted in 48 ASVs. For the visualization module, UMAP parameters were set as follows: number of neighbors of 5, minimum distance of 0.10, number of components of 2 and an Eucledian metric. In the case of HDBSCAN the parameters were: minimum cluster size of 5, minimum sample size of 3 and Bray-Curtis metric. Network construction and network analyses were performed as described in previous sections. The raw data from the kombucha database is in the github repository so that the main results can be replicated and the user could interact with them in the dashboard.

Case study: Archean Domes
A 16S amplicon dataset was provided from a highly diverse microbial community named Archean Domes. This dataset comes from a microbial mat located in the Cuatro Ciénegas Basin (CCB), Coahuila, Mexico (coordinates 26˚49'41.7'' N, 102˚01'28.7'' W). The sampling used in this case study, which consists of ten samples along a 1.5 m transect, represents a natural community with more than 6,000 ASVs [80]. Compositional data was acquired as raw reads form 16S amplicon sequencing. Reads were filtered and processed for clustering and taxonomic annotation in QIIME 2 platform, as shown by the authors [80]. Singletons and unique counts were subsequently filtered as suggested, and consequently, 2,600 ASVs remained [25]. The ASV abundance matrices along with a taxonomic annotation for these sequences were used as input for the MicNet toolbox. For the visualization module, UMAP and HDBSCAN parameters were set as follows: number of neighbors of 15, minimum cluster size of 15, and minimum sample size of 5. Network construction and analyses were implemented as described in previous sections.

Validations
The enhanced SparCC. To validate that the modifications performed to SparCC did not affect its performance, we ran our version of SparCC on the dataset provided by Friedman & Alm (2012) [17]. We compared our estimated correlation with their true basis correlation (Fig 4A and 4B). We found an overall RMSE of 0.08, and a consistent value of RMSE when estimating small and large correlation values from the simulated samples (Fig 4C). Thus, although the original pipeline of SparCC was not modified, by implementing several techniques that parallelized different parts of the code, our implementation of SparCC can now be used for large databases in a reasonable amount of time with relatively small RMSE.
We then moved on to characterize the execution times and RAM consumption of the new SparCC version in comparison to the previous one. In Fig 5 we show that for relatively small datasets, that is, those containing less than 1,000 OTUs, both versions do similarly in terms of execution times, with the previous version of SparCC being slightly faster. However, for larger datasets, the execution time of the algorithm increases in an exponential fashion, taking approximately 9 hours to run the largest dataset we had of 2,600 OTUs. In comparison, the execution times of the new version of SparCC scales better to larger datasets, with an execution time of around 2.5 hours for the 2,600 OTUs dataset. The consumption of RAM memory is more less constant in both versions, but there is a higher RAM consumption in the new version of SparCC as a result of the parallelization of some processes of the algorithm. However, we can see that the RAM consumption does not grows exponentially with the size of the dataset and it has the advantage that by parallelizing some processes the execution time is significantly reduced.
Biological validation: The kombucha consortium. To demonstrate how MicNet tools could infer ecological associations, we used a kombucha data set to replicate main global and local behavior. Kombucha is a simple and well-studied microbial consortium of bacteria and yeast, which grow as biofilm due to cellulose production from acetic acid bacteria (AAB) [91,94], but also develops as a liquid consortium. This consortium has been suggested as a convenient tractable system, whose general cooperative and antagonistic multi-species interactions have been previously described [94][95][96]. After filtering singletons and unique taxa from the raw data, only 48 ASVs remained in the analysis, corresponding to five annotated bacterial taxa and three fungal taxa.
As MicNet pipeline suggests, first, the community was visualized and analyzed with UMAP and HDBSCAN to uncover global patterns and noise taxa. Clusters from HDBSCAN showed one main group containing almost all ASV (34 of 48), a small group with 12 ASV and 2 ASV classified as noise shown in purple (Fig 6C). This could refer to a close-interacting community where highly stratified interactions are not common. Since the kombucha community has similar compositions in both homogeneous liquid and biofilms [91], physical closeness between all organisms is expected; this was reflected in the formation of one main group with the HDBSCAN algorithm. We then obtained the co-occurrence matrix of the kombucha samples using our modified version of SparCC. S2 Table shows the main metrics of the kombucha network and Fig 6A and 6B show the resulting co-occurrence network.
The resulting co-occurrence network was indeed a relatively connected one, with a connectance of 0.23. This is reflected in the high average degree, which indicates that each ASV is related on average with around 10 ASVs out of the 48 that are present in the community. Highly connected networks could point to a more homogenous environments, including liquid consortiums and slightly stratified biofilms in which kombucha develops, as opposed to more stratified environments, such as soil and microbial mat communities. The kombucha consortium is the result of a metabolic interplay between its microbial consortium, and though cheaters and antagonistic relationships are known [97], more cooperative relationships have been reported [94,95]. In fact, the network did show slightly more positive (55%) than negative relationships (45%), and a major contribution due to mutualist or commensalist interactions is expected.
We perform a topology comparison analysis to explore if the kombucha consortium fits within three canonical networks. Based on metrics bootstrapping from comparable networks, the kombucha network's degree standard deviation, average path length, SW index suggest

PLOS ONE
some degree of small-worldness [55] (Fig 7A). Additionally, comparison of kombucha CCDF with these simulated networks' CCDF suggest that the degree distribution of kombucha is not following particularly any specific canonical topology, although it appears to fit best with a random network (Fig 7B). These results are consistent, as kombucha medium is a non-stratified environment when it develops as liquid consortium, explaining the random properties, but still a stratified microbial consortium when developed as biofilm, explaining the scale-free properties. Thus, kombucha community plausibly shows properties in which microorganisms are adapted to biologically interact with each other in both homogeneous or heterogeneous (to some degree) structures. Thus, real data rarely conforms to a single mathematical topology, but comparing the data's metrics and degree distribution can hint towards one or another structure and give us an idea of which metrics are better to discern between topologies.
Although some biological interactions in the kombucha consortium still need to be confirmed, the global interplay between AAB and yeasts is well-known. In kombucha fermentation, yeast produce invertase which cleave sucrose into glucose and fructose, and further use fructose to produce ethanol. Ethanol is a noxious compound for the consortium. Hence, as a mechanism to regulate ethanol concentrations in the media, AAB transforms glucose and ethanol into gluconic and acetic acid, respectively, exhibiting a straightforward case of syntrophy [91,94]. This biological interplay was depicted in the ASV classified as Zygosaccharomyces baillii, the most abundant yeast in the sample, and the ASV with the greatest number of interactions (S3 Table). Second to Z. bailii, an ASV corresponding to Komagataeibacter europaeus (an AAB) appears to be a key central taxa based on each centrality metric. According to the inferred correlations by the enhanced SparCC (correlation matrices at the genus and species level are provided in the in S1 File), the Z. bailii correlated to every other ASVs, and mainly positively co-occurring with Komagataeibacter (0.1812), and negatively correlating with Cortinarius (-0.2081). Actually, for the species within the Komagataeibacter show the highest positively mean correlation with Z. bailii, particularly Komagataeibacter europeus (0.5397), reflecting the cooperative interplay between yeast and AAB. Additionally, Komagataeibacter, an AAB genus, produces acetic acid which inhibits growth of other species, except for Z. bailii [91,98]. More specifically, it has been reported that K. rhaeticus is one of the main producers of acetic acid compared to other microbial species [99]. As expected, Komagataeibacter genus shows negative interactions against some of other species different from their own genus, such as Variovorax that is negatively correlated, which might be explained by with its growth-inhibiting capability.
From mean relationships within the taxa present in kombucha, ASVs from the same species tend to have more positive relations between themselves, and this was reflected in the community clustering analysis, where we found that communities were appreciably grouped per species, according to the Louvain method (S2 Fig). Clustering resulting from phylogenetic relatedness is common in microbial data, and it may reflect niche overlapping to some degree [4,31]. From the 5 communities predicted with the Louvain method, one of them is considered as noise as it consists of just 1 ASV. HDBSCAN group composition is variable, as most ASV belong to just one group (S3 Fig). Nonetheless, the smaller group with 12 ASV's from HDBSCAN is particularly interesting, as it includes most of K. europaeus and the Z. bailii ASV, probably depicting the core syntrophic interactions. This potential core group is similarly shown as a Louvain method group. Main metrics for each group of the community analysis (via Louvain or HDBSCAN groups) are reported in S4 and S5 Tables.
Another aspect of kombucha interactions is that even though yeast could be an important player in the metabolic interplay with AAB, AAB are not fully dependent on them for substrates, characterizing their interaction as some class of non-strict parasitism [96]. In the co-occurrence network, we found evidence that Z. bailii was indeed considered a key organism given its high centrality metrics, but in the percolation analysis where nodes were removed by degree centrality (beign Z. bailii), there was not a breakdown of the network (nor in the network density, the average degree, the number of components or the number of communities, as shown in S6 Table, along with other percolation analyses performed). In contrast, percolation analysis where the nodes are removed depending on their genus exhibit a network breakup of several components when most of Komagataeibacter nodes were removed. Even by removing six Komagataeibacter nodes, the network is disrupted into three components, further supporting the relevance of Komagataeibacter in the kombucha network. Percolation analysis and centrality metrics are consistent in positioning the Komagataeibacter as a crucial genus to the community, and this can be biologically understood due to 1) their independence (to some degree) from yeast to thrive, 2) their cellulose production capability (as a mechanism for protection and resource storage [94]), and 3) as regulators on ethanol concentration.

Case study: Archean Domes microbial mats
To further evaluate the performance of MicNet as a high-throughput toolbox capable of analyzing a highly diverse and complex environment, we tested it on a compositional dataset of ten samples from a microbial mat in Cuatro Ciénegas, Mexico, in the Chihuahuan Desert [80]. This microbial mat, The Archean Domes, thrives in a fluctuating hypersaline pond which has been previously described as hyperdiverse [80,100]. Like every microbial mat, it is a stratified community with an intricate metabolic interplay between their organisms [101]. Espinosa-Asuar et al. (2021) provided us their microbial data set of 6,063 ASVs, as they have reported. To begin the pipeline, the abundance matrix for all ASV was filtered to exclude low abundance and unique ASVs, remaining 2,600 amplicon sequence variants for the analysis.
First, we search for global patterns and local clustering within the ASV abundance matrix with UMAP and HDBSCAN. The community was grouped into 49 groups, with a mean of 50 nodes in each group (Fig 8C). With this approach, the HDBSCAN algorithm allowed us to https://doi.org/10.1371/journal.pone.0259756.g008 categorize organisms as noise or outliers, as 553 ASV did not group with any cluster and were categorized as noise, and 260 behaved as outliers.
Afterwards, we computed the correlation matrix with the modified version of SparCC. Main large-scale characteristics of the network are shown in S7 Table. With the 2,600 ASV we recreated a network with 463,609 interactions with all of them aggregated in a big but sparsely connected network (density = 0.137) (Fig 8A and 8B). This value is consistent with the interaction density of an antagonist network between 37 Gammaproteobacteria strains isolated from water samples of Churince (reported density = 0.14), a lake situated also in the CCB [102], further suggesting the fact that microbial mats' networks are often sparse [1,31].
Network topology comparison for the Archean Domes network display how high-complexity systems do not fit greatly simplified theoretical models. This was shown in the metrics bootstrapping from simulated comparable networks, where most Archean Domes metrics fall in between a scale-free and small-world network distributions (S4 Fig). Particularly, degree standard deviation, modularity, SW index, and clustering coefficient suggest that Archean Domes possess intermediate properties between scale-free and small-world networks. On the other hand, the average path length of 1.86 is typical of a scale-free network. These results show that this community is not randomly assembled, and that as expected from microbial data, the network is likely an intermediate between a scale-free and a small-world network [31,55].
Potential key taxa analysis based on node centrality was performed for this high-complexity network. An ASV corresponding to a bacterium from the order MSBL9, class Phycisphaerae, phylum Planctomycetes, exhibit the highest centrality values, regardless of the centrality measure employed (S8 Table). This bacterial class has been previously described in hypersaline microbial mats as anoxic, fermenting, halotolerant and halophilic microorganisms [103,104]. Looking at other top centrality nodes, most of them were associated to unassigned sequences, except for a node member of the class Parcubacteria and another one from the genus Imperialibacter (S8 Table). Moreover, unknown taxa as central nodes further suggest the relevance of ''microbial dark matter" in ecosystems, and particularly, in hypersaline environments like the one studied here [105].
Local correlations between nodes at different taxonomic levels was inspected (correlation matrices at the phylum and species level are provided in the S1 File). From the 463,610 total relationships in the network, we only highlight some of them which might be insightful. At the phylum level, ASVs from Proteobacteria, the most abundant phylum in the sample, have positive mean correlations with most of the phyla, except Nanoarchaeota, Dependentiae and other unassigned prokaryotes. Cyanobacteria, a key phylum in microbial mats, on the other hand face more negative mean correlations with other phyla, including Synergistetes, Acetothermia and Chloroflexi. These negative correlations prospectively originate from overall antagonistic interactions or different niche requirements (such as wavelength, carbon metabolism or temperature adaptation for potential chloroflexi-cyanobacteria associations [106]). Lower taxonomic associations could be inspected. For example, the most central hub taxa (unassigned MSBL9, class Phycisphaerae) positively co-occurs the most (0.5903) with a bacterium from the genus Dehalobium, while negatively co-occurs the most with a deltaproteobacteria from the Syntrophobacteraceae (-0.6177).
Although most ecological associations (particularly biological interactions) are analyzed by pairs, species also associate and interact involving more than two organisms, which are vital for ecosystem diversity [107]. Triad motif identification and structural balance theory attempts to address this issue in microbial networks. For the Archean Domes network, we identify that most of the triads are balanced, with a highly balanced triad fraction of 0.9995. One property of structural balanced networks is group division, wherein all intergroup ties are negative, and all intragroup ties are positive [108]. Potentially applied to microbial ecology, we suggest that structural balance could reflect niche differentiation, in conjunction with other network metrics and properties that have been previously described as useful [1,4,109].
Furthermore, we carried out a community analysis to inspect subnetwork properties. Our two clustering methods, Louvain and HDBSCAN groups, display contrasting results, which could reflect different ecological structures. Louvain grouping algorithm resulted in 7 network communities with mean total nodes of 371.43 and mean density of 0.4269, while HDBSCAN showed 50 clusters with mean total nodes of 40.94 and mean density of 0.73. Main metrics for each group, for each grouping algorithm, are shown in S9 and S10 Tables. Phyla composition within most groups (either Louvain or HDBSCAN) is highly diverse, which could mirror spatial structures or compartments at different scales [3,59], which is consistent with the stratified structure of microbial mats [101]. S5 and S6 Figs display the phyla composition for Louvain and HDBSCAN clustering methods, accordingly.
High-diversity communities are commonly associated with an overall stable state. While the high number of ASV in Archean Domes probably reflects a highly diverse system, novel methods to assess ecosystem stability have been suggested. In microbiomes, positive relationships alone are prone to destabilize microbial networks, as they can create highly dependent and vulnerable feedback loops [3,110]. Archean Domes co-occurrence network show a negative:positive ratio of 0.94, thus, positive and negative relationships are evenly present, suggesting an ecologically resilient, resistant and stablished community. Within this scheme, negative relationships in the community might be originating from antagonistic competitive taxa, lower abundance of facilitative taxa (mutualists), divergent niche requirements, or a combination of all of them [1,3]. Modularity further bolster the stability hypothesis within the microbial mat. Modular groups (or clusters), whether product of biological interactions or habitat preference, plausibly aid in the stability of the system, as fewer links between groups likely ameliorate the spread of local perturbations to other groups. Modularity in Archean Domes shows a high value of 0.34, higher than the kombucha network and other published biological networks [3]. Low average path length (1.86 in Archean Domes network) likely function as a measure of response capacity to disturbances, hinting about the ecological resilience capabilities of this microbial mat [4,31]. Finally, Percolation analysis delves deeper into community stability. We performed random, centrality and phylum percolation simulations, from which the resulting metrics are shown in S11 Table. While none of the node removals induce the breakup of the giant component, the total number of communities and overall modularity do perceive the effects of network perturbations. With this in mind, some degree of ecological robustness could be reflected by the impact of percolation simulations to the co-occurrence network.

Discussion
MicNet toolbox has shown to be a promising pipeline. Our new implementation of the SparCC algorithm allows larger datasets to be processed without overflowing the RAM in a reasonable time. Furthermore, UMAP and HDBSCAN, relatively new dimension reduction and clustering techniques, are promising in microbial ecology studies since, as suggested in this work, they are useful methods to identify metabolic groups, niche overlapping, or subcommunities. Finally, given the potential of processing co-occurrence networks with a graph theory approach, we have included several network analyses, both new and commonly used, to further describe and understand the resulting networks from SparCC.
One important aspect to have in mind when using the MicNet toolbox is sample size. UMAP and HDBSCAN are known to be quite sensitive to the size of the database used. With a very small dataset (less than 50 OTUs/ASVs) we recommend being cautious at the interpretation level. Dalmaijer et al. (2020) [78] have suggested that for optimal use of UMAP and HDBSCAN, it is recommended to have around 20-30 data points per expected cluster or subgroup. Furthermore, the choice of parameters for these two techniques should be done with careful consideration, in particular the number of nearest neighbors and minimum distance [65]. We hope that the interactive dashboard will help in this aspect, since parameters can be modified in a simple way.
In terms of network topology, not all large-scale metrics of a network should be used to discern between topologies. As we show previously, some metrics, such as diameter, have no use to discern topologies, whereas others, like the small-word index provide useful information [55]. Moreover, it is unlikely that any biological system follows exactly a single topology, as shown by our case study: Archean Domes, the hyperdiverse microbial mats in CCB. However, we believe that knowing whether a network tends more towards a random, scale-free or small-world could give insightful pointers about its general behavior. For example, a small-world model (created with the Watts-Strogatz algorithm) suggests that the network will have short path lengths, because it is formed by highly connected clusters, which are weakly connected among each other; whereas a scale-free model (created with the Barabasi-Albert algorithm) suggests that networks will have short average path lengths as well, as a consequence that certain nodes that have very high degree and can act as hubs; in both cases the average distance between nodes would be expected to be small but for different reasons, which could give us insight of key biological network properties [85].
Microbial communities in the light of complexity have shown, once again, the potential in drawing biological conclusions from networks. Nonetheless, the biological interpretation of UMAP, HDBSCAN and network analysis should also be taken cautiously. As we mentioned before, the interpretation of the different metrics obtained is debatable, and there is a high diversity of interpretations and terms (see synonyms in Table 1) used for the same ecological concepts. As for today, researchers are encouraged to correlate network theory metrics to biological significance in an attempt to find fundamental metrics that could be useful to describe a particular biological phenomenon in whichever microbial system. With MicNet we suggest an analysis pipeline including visualization, co-occurrence network creation and postprocessing of the resulting network with graph theory analyses that could be used as a standard method for network analysis, offering an overview of a microbial community and enabling the comparison between different microbial systems. Potentially, this approach promises to aid in the search for biologically fundamental metrics.
Biological validation with a kombucha consortium was accomplished, as known local and global behavior, including key taxa and interspecific biological interactions empirically confirmed elsewhere [91,94], were reproduced by the proposed toolbox. Moreover, our high-complexity case study, Archean Domes, displays the scope and usefulness of MicNet toolbox by deconstructing microbial co-occurrence networks to manageable biological knowledge. Aware of current caveats of the limitations microbial co-occurrence networks have [1,26], we restate that this approach should be taken as a roadmap for further research on the microbiome system, rather than a conclusive analysis. For example, directed studies to Phycisphaerae bacteria (and other central taxonomic groups) can be performed, and consequently, assess the relevance of this taxon to the whole community structure and functioning. Similarly, ''microbial dark matter" characterization and relevance could be further explored with the increasing technologies and databases [111]. Directed co-culture experiments and other novel strategies such as microdroplets are fundamental for biological interactions' validation [31], which could be applied to inferred correlations between taxa of interest. Moreover, module aimed experiments, including synthetic microbial communities, are tractable strategies that, although reducing complexity of the system, could be informative about mid-scale structures crucial to the system's stability [112], especially if the experiments include perturbations [113].
Given its potential usefulness, understanding both global and local patterns in microbial communities may be a wise strategy to delve deeper into their currently unknown properties. With the introduction of the MicNet toolbox, we hope that the research community will be able to implement several existing and new analysis techniques in a straightforward manner to further keep unravelling the intricate conundrums that microbiomes hold.  Table. Centrality measures from potential key players in kombucha network. Degree, closeness, betweenness, and PageRank centrality was calculated for the top 5 ASV respectively. If a given ASV was not among the top 5 in a centrality measure, the value is reported as NA. (PNG) S4  Table. Centrality measures from potential key players in Archean Domes network. Degree, closeness, betweenness, and PageRank centrality was calculated for the top 10 ASV respectively. If a given ASV was not among the top 10 in a centrality measure, the value is reported as NA. (PNG) S9 Table. Community analysis (Louvain groups) metrics on Archean Domes network. For each community, total edges, total nodes, average degree, clustering coefficient, and density were calculated. Clusters are ordered by increasing density. (PNG) S10 Table. Community analysis (HDBSCAN groups) metrics on Archean Domes network. For each community, total edges, total nodes, average degree, clustering coefficient, and density were calculated. Clusters are ordered by increasing density. (PNG) S11